Instructions¶

  1. Labeling & Peer Grading: Your homework will be peer graded. To stay anonymous, avoid using your name and label your file with the last four digits of your student ID (e.g., HW#_Solutions_3938).

  2. Submission: Submit both your IPython notebook (.ipynb) and an HTML file of the notebook to Canvas under Assignments → HW # → Submit Assignment. After submitting, download and check the files to make sure that you've uploaded the correct versions. Both files are required for your HW to be graded.

  3. No pdf file required so write all the details in your ipynb file.

  4. AI Use Policy: Solve each problem independently by yourself. Use AI tools like ChatGPT or Google Gemini for brainstorming and learning only—copying AI-generated content is prohibited. You do not neeViolations will lead to penalties, up to failing the course.

  5. Problem Structure: Break down each problem ( already done in most problems) into three interconnected parts and implement each in separate code cells. Ensure that each part logically builds on the previous one. Include comments in your code to explain its purpose, followed by a Markdown cell analyzing what was achieved. After completing all parts, add a final Markdown cell reflecting on your overall approach, discussing any challenges faced, and explaining how you utilized AI tools in your process.

  6. Deadlines & Academic Integrity: This homework is due on 10/01/2024 at midnight. Disclosure of this assignment and assignment answers to anyone or any website is a contributory infringement of academic dishonesty at ISU. Do not share or post course materials without the express written consent of the copyright holder and instructor. The class will follow Iowa State University’s policy on academic dishonesty. Anyone suspected of academic dishonesty will be reported to the Dean of Students Office.

Each problem is worth 25 points. Total $\bf 25\times 4 = 100$.¶

Problem 1.¶

Upload the textdata.csv and preprocess the text excerpts in the text column.

  • Find the various numerical information related to these text excerpts and add them to the textdata.csv as new columns with appropriate labels. The target variable "Bradley-Terry_Score"(https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) is related to the readability of the text excerpt; use the following links to learn more about various other scores and create new columns ( at least 10) with each text excerpts respective score and add the columns in the data. More information on text statistics are in https://pypi.org/project/textatistic and https://pypi.org/project/textstat/
  • Perform feature selection using methods such as correlation analysis, Recursive Feature Elimination (RFE), SelectKBest, or other relevant techniques, considering Bradley_Terry_Score as the target. Display a correlation heat map of the selected features and the target variable.
  • Create multiple linear regression models using Bradley_Terry_Score as the target variable, testing with three different test set sizes: 20%, 25%, and 30%. Cross-validate all models and summarize the test set metrics, including Mean Absolute Deviation (MAD), and R-squared (R²) in a table to identify the best model. Assess the suitability of developing a regression model for this problem, and provide your rationale based on the data and analysis results.
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Upload data
import pandas as pd
import numpy as np
textdata = pd.read_csv("textdata.csv")
textdata.head(10)
Out[2]:
textid text Bradly_Terry_Score
0 c12129c31 When the young people returned to the ballroom... -0.340259
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372
2 b69ac6792 As Roger had predicted, the snow departed as q... -0.580118
3 dd1000b26 And outside before the palace a great garden w... -1.054013
4 37c1b32fb Once upon a time there were Three Bears who li... 0.247197
5 f9bf357fe Hal and Chester found ample time to take an in... -0.861809
6 eaf8e7355 Hal Paine and Chester Crawford were typical Am... -1.759061
7 0a43a07f1 On the twenty-second of February, 1916, an aut... -0.952325
8 f7eff7419 The boys left the capitol and made their way d... -0.371641
9 d96e6dbcd One day he had gone beyond any point which he ... -1.238432
In [3]:
textdata.text[0]
Out[3]:
'When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes.'
In [4]:
# Use count vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english').fit(list(textdata.text))
trainedtext1 = vect.transform(list(textdata.text))
colnames1 = vect.get_feature_names_out()
nparray1 = trainedtext1.toarray()
df1 = pd.DataFrame(nparray1, columns = colnames1)
df1.head(2)
Out[4]:
00 000 000th 001 02 03 034 04 049 06 ... µv ½d ædui ægidus æmilius æneas æolian æquians æschylus ça
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 26526 columns

In [5]:
# Use tfidf vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english').fit(list(textdata.text))
trainedtext2 = tfidf_vect.transform(list(textdata.text))
# As the names are same as count vectorizor, just added "A" to make the 
#feature names different
colnames2 = ["A" + item for item in tfidf_vect.get_feature_names_out()]
nparray2 = trainedtext2.toarray()
df2 = pd.DataFrame(nparray2, columns = colnames2)
df2.head(2)
Out[5]:
A00 A000 A000th A001 A02 A03 A034 A04 A049 A06 ... Aµv A½d Aædui Aægidus Aæmilius Aæneas Aæolian Aæquians Aæschylus Aça
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 26526 columns

In [6]:
# Combines to data frames to one by cancatenating the columns.
df = pd.concat([df1,df2], axis = 1)
df.shape
Out[6]:
(2834, 53052)
In [7]:
!pip install textstat
Requirement already satisfied: textstat in /opt/anaconda3/lib/python3.12/site-packages (0.7.4)
Requirement already satisfied: pyphen in /opt/anaconda3/lib/python3.12/site-packages (from textstat) (0.16.0)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.12/site-packages (from textstat) (74.1.2)
In [8]:
import textstat
from textblob import TextBlob

def find_various_scores(text_data):
    # Readability scores
    fre = textstat.flesch_reading_ease(text_data)
    fkg = textstat.flesch_kincaid_grade(text_data)
    smog = textstat.smog_index(text_data)
    cli = textstat.coleman_liau_index(text_data)
    ari = textstat.automated_readability_index(text_data)
    dcr = textstat.dale_chall_readability_score(text_data)
    dw = textstat.difficult_words(text_data)
    lwf = textstat.linsear_write_formula(text_data)
    gf = textstat.gunning_fog(text_data)
    fh = textstat.fernandez_huerta(text_data)
    sp = textstat.szigriszt_pazos(text_data)
    gp = textstat.gutierrez_polini(text_data)
    craw = textstat.crawford(text_data)
    gi = textstat.gulpease_index(text_data)
    osman = textstat.osman(text_data)
    syllable_count = textstat.syllable_count(text_data)
    character_count = textstat.char_count(text_data)
    word_count = textstat.lexicon_count(text_data, removepunct=True)
    sentence_count = textstat.sentence_count(text_data)
    words = text_data.split()
    lexical_density = len([word for word in words if word.isalpha()]) / len(words)
    ttr = len(set(words)) / len(words)
    hapax_legomena = len([word for word in set(words) if words.count(word) == 1])
    sentences = text_data.split('.')
    avg_sentence_length = sum(len(sentence.split()) for sentence in sentences) / len(sentences)
    complex_word_count = len([word for word in words if textstat.syllable_count(word) >= 3])
    sentiment = TextBlob(text_data).sentiment
    polarity = sentiment.polarity
    subjectivity = sentiment.subjectivity
    scores = [fre, fkg, smog, cli, ari, dcr, dw, lwf, gf, fh, sp, gp, craw, gi, osman, 
              syllable_count, character_count, word_count, sentence_count, lexical_density,
              ttr, hapax_legomena, avg_sentence_length, complex_word_count, polarity, subjectivity]    
    return scores
In [9]:
# Create a new data frame of reading scores.
score_labels = ["fre", "fkg", "smog", "cli", "ari", "dcr", "dw", "lwf", "gf", "fh", "sp", "gp", "craw", "gi", 
                "osman", "syllable_count", "character_count", "word_count", "sentence_count", "lexical_density", 
                "ttr", "hapax_legomena", "avg_sentence_length",  "complex_word_count", "polarity", "subjectivity"]
scores = [find_various_scores(textdata.text[i]) for i in range(len(textdata))]
scoresdf = pd.DataFrame(scores, columns = score_labels)
scoresdf.head(2)
Out[9]:
fre fkg smog cli ari dcr dw lwf gf fh ... character_count word_count sentence_count lexical_density ttr hapax_legomena avg_sentence_length complex_word_count polarity subjectivity
0 80.31 6.1 8.6 7.94 8.1 7.80 25 9.00 8.31 112.21 ... 814 179 11 0.849162 0.636872 89 14.916667 10 0.134848 0.525758
1 84.57 4.5 8.0 6.31 6.0 6.39 17 6.25 6.73 116.50 ... 769 169 14 0.727811 0.751479 107 15.545455 10 0.133999 0.566643

2 rows × 26 columns

In [10]:
# Finally combined all the data sets with features.
finaldf = pd.concat([df,scoresdf], axis = 1)
finaldf.shape
Out[10]:
(2834, 53078)
In [11]:
## We can add the target bariable to data but not required.
combined_df = finaldf
combined_df["Bradly_Terry_Score"] = textdata["Bradly_Terry_Score"]
In [12]:
# Quickly run a regression model to see what we are doing.
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
y = textdata.Bradly_Terry_Score
RLR = LinearRegression().fit(finaldf, y)
ypred = RLR.predict(finaldf)
metrics.r2_score(y, ypred)
Out[12]:
1.0
In [13]:
# Filter the features with the correlation of more than 0.25 in absolute value
corrs = finaldf.corrwith(textdata['Bradly_Terry_Score']).abs().to_frame().reset_index()
corrs.columns = ["feature", "correlation"]
# Only significantly correlated with the absolute correlation of 0.25 or more
corrs = corrs[corrs["correlation"] >= 0.25] 
corrs.shape
Out[13]:
(23, 2)
In [14]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize = (20,16))
highly_correlated = combined_df[list(corrs.feature)]
sns.heatmap(round(highly_correlated.corr(),2), cmap="Reds", annot=True)
Out[14]:
<Axes: >
No description has been provided for this image

As demonstrated in the analysis, our model is experiencing overfitting. I experimented with various values of k, ranging from 50 to 300, and even tried larger values like 500 and 1000. However, those higher values of k led to significant overfitting.¶

For each value of k, I calculated key performance metrics, including R² and MAE for both the training and test datasets. Additionally, I computed the absolute difference between the training and testing scores (R² and MAE) to assess the model’s generalization performance.¶

These metrics were plotted against k, and from the results, it is evident that k = 100 appears to be the optimal choice from the range of values tested. At this point, the model strikes a good balance between underfitting and overfitting, as indicated by the minimal gap between training and testing scores.¶

In [15]:
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
results = []
y = textdata.Bradly_Terry_Score
# Loop over values of k from 50 to 300 with increments of 10
for k in range(50, 301, 10):
    # Select K best features based on f_regression
    selector = SelectKBest(f_regression, k=k)
    X_new = selector.fit_transform(finaldf, y)
    scores = selector.scores_
    feature_names = finaldf.columns
    selected_features = sorted(zip(scores, feature_names), reverse=True)[:k]
    selected_features = pd.DataFrame(selected_features, columns=["fscore", 'feature'])
    x = finaldf[list(selected_features.feature)]
    xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=99)
    rlr = LinearRegression().fit(xtrain, ytrain)
    rlrtrainpred = rlr.predict(xtrain)
    rlrtestpred = rlr.predict(xtest)
    rlrtrain_r2 = r2_score(ytrain, rlrtrainpred)
    rlrtest_r2 = r2_score(ytest, rlrtestpred)
    rlrtrain_mae = mean_absolute_error(ytrain, rlrtrainpred)
    rlrtest_mae = mean_absolute_error(ytest, rlrtestpred)
    r2_diff = abs(rlrtrain_r2 - rlrtest_r2)
    mae_diff = abs(rlrtrain_mae - rlrtest_mae)
    results.append([k, rlrtrain_r2, rlrtest_r2, rlrtrain_mae, rlrtest_mae, r2_diff, mae_diff])
results_df = pd.DataFrame(results, columns=["k", "Training R^2", "Testing R^2", "Training MAE", 
                                            "Testing MAE", "R^2 Diff", "MAE Diff"])
results_df
Out[15]:
k Training R^2 Testing R^2 Training MAE Testing MAE R^2 Diff MAE Diff
0 50 0.473059 0.428579 0.595679 0.631539 0.044481 0.035859
1 60 0.478712 0.432900 0.591193 0.631176 0.045812 0.039983
2 70 0.483812 0.434649 0.588988 0.628207 0.049164 0.039219
3 80 0.491386 0.438847 0.585684 0.624213 0.052539 0.038529
4 90 0.497236 0.455822 0.582538 0.612134 0.041414 0.029597
5 100 0.499460 0.459576 0.581504 0.608413 0.039884 0.026910
6 110 0.506515 0.462360 0.578041 0.607275 0.044156 0.029234
7 120 0.513370 0.464354 0.573091 0.603305 0.049016 0.030215
8 130 0.518394 0.467688 0.570319 0.602883 0.050707 0.032564
9 140 0.524655 0.465978 0.566429 0.603145 0.058677 0.036717
10 150 0.526992 0.474836 0.564646 0.598683 0.052156 0.034037
11 160 0.533485 0.477909 0.558736 0.594478 0.055575 0.035742
12 170 0.539114 0.482567 0.555165 0.591401 0.056547 0.036235
13 180 0.544298 0.482961 0.551881 0.589890 0.061337 0.038009
14 190 0.548352 0.481771 0.549274 0.590751 0.066581 0.041477
15 200 0.551697 0.478382 0.547183 0.592717 0.073315 0.045534
16 210 0.554866 0.480589 0.546476 0.590899 0.074277 0.044423
17 220 0.563283 0.472511 0.540576 0.601800 0.090772 0.061224
18 230 0.565737 0.471946 0.538634 0.603924 0.093791 0.065290
19 240 0.566656 0.474461 0.537568 0.601785 0.092195 0.064217
20 250 0.571311 0.468473 0.534320 0.604913 0.102838 0.070593
21 260 0.572278 0.468171 0.533805 0.603410 0.104107 0.069606
22 270 0.574425 0.468176 0.532554 0.606380 0.106249 0.073826
23 280 0.577111 0.460261 0.530374 0.608635 0.116850 0.078261
24 290 0.580748 0.457493 0.529132 0.609728 0.123256 0.080596
25 300 0.586936 0.452192 0.524680 0.613878 0.134744 0.089198
In [16]:
# Bias variance trade off
import matplotlib.pyplot as plt
print("Bias Variace Trade off")
# Create subplots for better visualization
fig, axs = plt.subplots(2, 2, figsize=(14, 10))

# Plot R² (Training and Testing)
axs[0, 0].plot(results_df['k'], results_df['Training R^2'], label="Training R²", color="blue", marker='o')
axs[0, 0].plot(results_df['k'], results_df['Testing R^2'], label="Testing R²", color="green", marker='o')
axs[0, 0].set_title('R² Scores vs k')
axs[0, 0].set_xlabel('k (Number of Features)')
axs[0, 0].set_ylabel('R² Score')
axs[0, 0].legend()
axs[0, 0].grid(True)

# Plot MAE (Training and Testing)
axs[0, 1].plot(results_df['k'], results_df['Training MAE'], label="Training MAE", color="blue", marker='o')
axs[0, 1].plot(results_df['k'], results_df['Testing MAE'], label="Testing MAE", color="green", marker='o')
axs[0, 1].set_title('MAE Scores vs k')
axs[0, 1].set_xlabel('k (Number of Features)')
axs[0, 1].set_ylabel('MAE Score')
axs[0, 1].legend()
axs[0, 1].grid(True)

# Plot R² Differences (Training - Testing)
axs[1, 0].plot(results_df['k'], results_df['R^2 Diff'], label="R² Difference", color="red", marker='o')
axs[1, 0].set_title('R² Difference (Train - Test) vs k')
axs[1, 0].set_xlabel('k (Number of Features)')
axs[1, 0].set_ylabel('R² Difference')
axs[1, 0].legend()
axs[1, 0].grid(True)

# Plot MAE Differences (Training - Testing)
axs[1, 1].plot(results_df['k'], results_df['MAE Diff'], label="MAE Difference", color="red", marker='o')
axs[1, 1].set_title('MAE Difference (Train - Test) vs k')
axs[1, 1].set_xlabel('k (Number of Features)')
axs[1, 1].set_ylabel('MAE Difference')
axs[1, 1].legend()
axs[1, 1].grid(True)

plt.suptitle(r'$\bf{Bias-Variance\ Trade-off}$', fontsize=18)

plt.tight_layout(rect=[0, 0, 1, 0.95])  
plt.show()
Bias Variace Trade off
No description has been provided for this image

With optimum k = 100, we will make a model and cross validate.¶

In [17]:
from sklearn.model_selection import cross_val_score
selector = SelectKBest(f_regression, k=100)
x_selected = selector.fit_transform(finaldf, y)
xtrain, xtest, ytrain, ytest = train_test_split(x_selected, y, test_size=0.20, random_state=11)
rlr = LinearRegression().fit(xtrain, ytrain)
cv_scores = cross_val_score(rlr, xtrain, ytrain, cv=10)
print("Scores for each fold:", cv_scores)
print("Mean score:", cv_scores.mean())
Scores for each fold: [0.46876487 0.49124043 0.37081605 0.50294202 0.3823276  0.48079141
 0.3472443  0.51423927 0.43114225 0.46785066]
Mean score: 0.4457358858677498

Now we try different test sizes to see if it makes any difference.¶

In [18]:
test_sizes = [0.2, 0.25, 0.3]
results = []
for test_size in test_sizes:
    xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=test_size, random_state=33)
    rlr = LinearRegression().fit(xtrain, ytrain)
    rlrtrainpred = rlr.predict(xtrain)
    rlrtestpred = rlr.predict(xtest)
    r2_train = r2_score(ytrain, rlrtrainpred)
    r2_test = r2_score(ytest, rlrtestpred)
    mae_train = mean_absolute_error(ytrain, rlrtrainpred)
    mae_test = mean_absolute_error(ytest, rlrtestpred)
    r2_diff = abs(r2_train - r2_test)
    mae_diff = abs(mae_train - mae_test)
    results.append([test_size, r2_train, r2_test, mae_train, mae_test, r2_diff, mae_diff])
columns = ['Test Size', 'R² Train', 'R² Test', 'MAE Train', 'MAE Test', 'R² Diff', 'MAE Diff']
df_results = pd.DataFrame(results, columns=columns)
df_results
Out[18]:
Test Size R² Train R² Test MAE Train MAE Test R² Diff MAE Diff
0 0.20 0.592301 0.446837 0.519889 0.628889 0.145464 0.109001
1 0.25 0.591647 0.462660 0.517424 0.624197 0.128987 0.106774
2 0.30 0.593872 0.448855 0.517783 0.618661 0.145017 0.100878
In [19]:
scores = cross_val_score(rlr, xtrain, ytrain , cv=10)
print("Scores for each fold:", scores)
print("Mean score:", scores.mean())
Scores for each fold: [0.53364538 0.40225684 0.31411637 0.36895486 0.40547841 0.36921788
 0.41510159 0.49053608 0.43035678 0.39776091]
Mean score: 0.412742508752204

We performed extensive work to build and evaluate the regression model, starting with selecting an optimal number of features and experimenting with various test sizes. We calculated two key metrics—R² and MAE—and explored the bias-variance trade-off. Our analysis showed that using 100 features (k = 100) and a test size of 20% or 25% provided the best results for our model. However, it's important to note that despite all these efforts, the scores achieved by the model are not particularly impressive.¶

In summary, we used a variety of numerical features extracted from text data to predict the Bradley-Terry Score, which measures readability in some form. The challenge here lies in predicting such an obscure score, which may have been defined in a different context, using numerical information that may not be directly related to the score itself. Our model performed reasonably well, but additional, more text-specific features (beyond simple vectorization techniques) could likely improve the prediction accuracy. Moreover, modern deep learning-based language models might be better suited to this task.¶

Overall, while our current model has limitations, this was a valuable learning exercise and a great opportunity to practice feature selection, model evaluation, and understanding the intricacies of predictive modeling. We might also consider refining the way we calculate the actual Bradley-Terry Score for better results in the future.¶

Problem 2.¶

You use the data from problem 1 with numerical columns for this problem.

  • Create a new column called "difficulty_level" that has 6 classes: very_hard, hard, challenging, moderate, easy, very_easy using the Bradly_Terry_Score scores. Note that negative scores mean harder to read, and positive scores mean easier to read. Use the boundary points as <-2.05, <-1.45, <-0.95, <-0.5, <0.08, and >= 0.08. Then use the feature selection method(s) for a classification model to select features to classify difficulty_level.
  • Now create classification model(s) (of your choice) with difficulty_level as a target variable. Use three different test set sizes 20%, 25%, and 30%. Make sure to cross-validate your models. Summarize the classification accuracy score using a table and pick your best model.
  • Make a test set precision, recall, and F1 score table for your best model in part 2. Note that we have a multiclass classification problem. Use your best model to determine which of the 6 classes the following text excerpt should belong to?
In [20]:
textdata['label'] = ['very_hard' if x < -2.05 else 'hard' 
                                 if -2.05 <= x <-1.45  else "challenging" 
                                 if -1.45 <= x <-0.95 else "moderate" 
                                 if -0.95 <= x <-0.5 else "easy"  
                                 if -0.5 <= x <0.08 else 'very_easy' 
                     for x in list(textdata.Bradly_Terry_Score)]
textdata.head(2)
Out[20]:
textid text Bradly_Terry_Score label
0 c12129c31 When the young people returned to the ballroom... -0.340259 easy
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372 easy

I am adding the newexcerpt as a new row sothat I can find the numerical information about it now and use it later for predecitions.¶

In [21]:
new_excerpt = """
Business analytics leverages advanced statistical modeling, predictive algorithms, and optimization techniques 
to derive actionable insights from organizational data. Data is sourced from systems like ERP and CRM, then processed 
through data pipelines for cleansing, normalization, and feature engineering. Analysts apply methods such as principal 
component analysis (PCA) and k-means clustering for dimensionality reduction and segmentation, respectively. 
Predictive models, including logistic regression, gradient boosting machines (GBM), and neural networks, 
are deployed for forecasting and classification tasks. Complex optimization techniques, such as mixed-integer linear 
programming (MILP), enhance resource allocation and operational planning. Tools like Python, R, and SQL, 
integrated with BI platforms like Tableau, support dynamic visualizations and scenario analysis, 
driving strategic decision-making.
"""
In [22]:
newrow = {'textid': 'newexcerpt', 'text': new_excerpt, 'Bradly_Terry_Score': "", 'label': ""}
new_df = pd.DataFrame([newrow])
newtext = pd.concat([textdata, new_df], ignore_index=True)
newtext.tail(3)
Out[22]:
textid text Bradly_Terry_Score label
2832 15e2e9e7a Solids are shapes that you can actually touch.... -0.215279 easy
2833 5b990ba77 Animals are made of many cells. They eat thing... 0.300779 very_easy
2834 newexcerpt \nBusiness analytics leverages advanced statis...
In [23]:
# Use count vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english').fit(list(newtext.text))
newtrainedtext1 = vect.transform(list(newtext.text))
newcols1 = vect.get_feature_names_out() # Extract the feature names which are just words
newarray1 = newtrainedtext1.toarray()
newdf1 = pd.DataFrame(newarray1, columns = newcols1)
newdf1.head(2)
Out[23]:
00 000 000th 001 02 03 034 04 049 06 ... µv ½d ædui ægidus æmilius æneas æolian æquians æschylus ça
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 26551 columns

In [24]:
# Use tfidf vectorizer to create numerical features and make a dafatrame
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english').fit(list(newtext.text))
newtrainedtext2 = tfidf_vect.transform(list(newtext.text))
# As the names are same as count vectorizor, just added "A" to make the 
#feature names different
newcols2 = ["A" + item for item in tfidf_vect.get_feature_names_out()]
newarray2 = newtrainedtext2.toarray()
newdf2 = pd.DataFrame(newarray2, columns = newcols2)
newdf2.head(2)
Out[24]:
A00 A000 A000th A001 A02 A03 A034 A04 A049 A06 ... Aµv A½d Aædui Aægidus Aæmilius Aæneas Aæolian Aæquians Aæschylus Aça
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 26551 columns

In [25]:
# Combines to data frames to one by cancatenating the columns.
newdf = pd.concat([newdf1, newdf2], axis = 1)
newdf.shape
Out[25]:
(2835, 53102)
In [26]:
# Create a new data frame of reading scores.
newscore_labels = ["fre", "fkg", "smog", "cli", "ari", "dcr", "dw", "lwf", "gf", "fh", "sp", "gp", "craw", "gi", 
                "osman", "syllable_count", "character_count", "word_count", "sentence_count", "lexical_density", 
                "ttr", "hapax_legomena", "avg_sentence_length",  "complex_word_count", "polarity", "subjectivity"]
newscores = [find_various_scores(newtext.text[i]) for i in list(range(len(newdf)))]
newscoresdf = pd.DataFrame(newscores, columns = newscore_labels)
newscoresdf.head(2)
Out[26]:
fre fkg smog cli ari dcr dw lwf gf fh ... character_count word_count sentence_count lexical_density ttr hapax_legomena avg_sentence_length complex_word_count polarity subjectivity
0 80.31 6.1 8.6 7.94 8.1 7.80 25 9.00 8.31 112.21 ... 814 179 11 0.849162 0.636872 89 14.916667 10 0.134848 0.525758
1 84.57 4.5 8.0 6.31 6.0 6.39 17 6.25 6.73 116.50 ... 769 169 14 0.727811 0.751479 107 15.545455 10 0.133999 0.566643

2 rows × 26 columns

In [27]:
# Finally combined all the data sets with features.
newfinaldf = pd.concat([newdf,newscoresdf], axis = 1)
newfinaldf.shape
Out[27]:
(2835, 53128)
In [28]:
# Remove the newexcerpt related row.
numerical_newexcerpt = newfinaldf.tail(1)
numerical_newexcerpt
Out[28]:
00 000 000th 001 02 03 034 04 049 06 ... character_count word_count sentence_count lexical_density ttr hapax_legomena avg_sentence_length complex_word_count polarity subjectivity
2834 0 0 0 0 0 0 0 0 0 0 ... 805 111 6 0.765766 0.846847 87 15.857143 36 0.016667 0.377778

1 rows × 53128 columns

In [29]:
newdata = newfinaldf.iloc[:-1]
y = textdata.label
newdata.shape
Out[29]:
(2834, 53128)
In [30]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
results = []
for k in range(100, 500, 20):
    selector = SelectKBest(chi2, k=k)
    X_selected = selector.fit_transform(X_scaled, y)
    selected_features = newdata.columns[selector.get_support()]
    xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)
    clsmodel = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=22)
    clsmodel.fit(xtrain, ytrain)
    trainpred = clsmodel.predict(xtrain)
    testpred = clsmodel.predict(xtest)
    train_accuracy = accuracy_score(ytrain, trainpred)
    test_accuracy = accuracy_score(ytest, testpred)
    accuracy_diff = abs(train_accuracy - test_accuracy)
    results.append([k, train_accuracy, test_accuracy, accuracy_diff])

# Convert the results into a DataFrame
accuracy_df = pd.DataFrame(results, columns=['k', 'Training Accuracy', 'Testing Accuracy', 'Accuracy Difference'])
accuracy_df
Out[30]:
k Training Accuracy Testing Accuracy Accuracy Difference
0 100 0.694310 0.320988 0.373322
1 120 0.704455 0.350970 0.353485
2 140 0.705337 0.347443 0.357895
3 160 0.704455 0.356261 0.348194
4 180 0.712836 0.358025 0.354812
5 200 0.709749 0.347443 0.362306
6 220 0.701367 0.352734 0.348634
7 240 0.702250 0.350970 0.351280
8 260 0.700926 0.379189 0.321738
9 280 0.702691 0.389771 0.312920
10 300 0.702691 0.382716 0.319975
11 320 0.691663 0.407407 0.284256
12 340 0.691222 0.380952 0.310269
13 360 0.694310 0.396825 0.297484
14 380 0.698280 0.391534 0.306745
15 400 0.699603 0.400353 0.299250
16 420 0.683723 0.395062 0.288661
17 440 0.693427 0.402116 0.291311
18 460 0.690340 0.391534 0.298805
19 480 0.696515 0.396825 0.299690
In [31]:
# !pip install tensorflow
# Tried Deep Learning but had the worse scores.
In [33]:
from sklearn.ensemble import RandomForestClassifier
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
results = []

# Loop over different numbers of features (k)
for k in range(10, 50, 5):
    # Feature selection
    selector = SelectKBest(chi2, k=k)
    X_selected = selector.fit_transform(X_scaled, y)
    selected_features = newdata.columns[selector.get_support()]
    
    # Train-test split
    xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)
    
    # Build and train a Random Forest classifier
    clsmodel = RandomForestClassifier(n_estimators=100, random_state=22)
    clsmodel.fit(xtrain, ytrain)
    
    # Predict and calculate accuracy
    trainpred = clsmodel.predict(xtrain)
    testpred = clsmodel.predict(xtest)
    
    # Calculate accuracy
    train_accuracy = accuracy_score(ytrain, trainpred)
    test_accuracy = accuracy_score(ytest, testpred)
    accuracy_diff = abs(train_accuracy - test_accuracy)
    
    # Append results
    results.append([k, train_accuracy, test_accuracy, accuracy_diff])

# Convert the results into a DataFrame
accuracy_df = pd.DataFrame(results, columns=['k', 'Training Accuracy', 'Testing Accuracy', 'Accuracy Difference'])
accuracy_df
Out[33]:
k Training Accuracy Testing Accuracy Accuracy Difference
0 10 0.945302 0.285714 0.659588
1 15 0.999559 0.301587 0.697972
2 20 1.000000 0.306878 0.693122
3 25 1.000000 0.296296 0.703704
4 30 1.000000 0.317460 0.682540
5 35 1.000000 0.329806 0.670194
6 40 1.000000 0.336861 0.663139
7 45 1.000000 0.350970 0.649030
In [34]:
# Scaling the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(newdata)
y = textdata.label
k = 320
selector = SelectKBest(chi2, k=k)
X_selected = selector.fit_transform(X_scaled, y)
selected_features = newdata.columns[selector.get_support()]

# Train-test split (you can skip this if you are using cross-validation for final results)
xtrain, xtest, ytrain, ytest = train_test_split(X_selected, y, test_size=0.20, random_state=99)

# Build and train the Gradient Boosting model
clsmodel = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=22)

# Perform 10-fold cross-validation
cv_scores = cross_val_score(clsmodel, X_selected, y, cv=10)

# Print results
print("Cross-validation scores for each fold:", cv_scores)
print("Mean cross-validation score:", cv_scores.mean())
Cross-validation scores for each fold: [0.32394366 0.29225352 0.28873239 0.33450704 0.39575972 0.42402827
 0.41342756 0.36749117 0.3180212  0.34275618]
Mean cross-validation score: 0.3500920718658239
In [35]:
selected_features
Out[35]:
Index(['1870', '2009', '22', 'absorbing', 'accompanying', 'account', 'acid',
       'adjustment', 'advantages', 'affected',
       ...
       'lwf', 'gf', 'fh', 'sp', 'craw', 'gi', 'syllable_count',
       'character_count', 'sentence_count', 'complex_word_count'],
      dtype='object', length=320)

Not a promising model here. Could go with any of the above. Let's just try the one with 20% test data.¶

In [36]:
pd.Series(ytest).value_counts()
Out[36]:
label
easy           101
very_hard       98
challenging     96
hard            95
very_easy       89
moderate        88
Name: count, dtype: int64
In [37]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score 
from sklearn.metrics import classification_report
testlabels = np.unique(ytest)
testcm = confusion_matrix(ytest, testpred, labels=testlabels)
print("Confusion Matrix for the Testing data\n-------------------------------------")
pd.DataFrame(testcm, index=testlabels, columns=testlabels)
Confusion Matrix for the Testing data
-------------------------------------
Out[37]:
challenging easy hard moderate very_easy very_hard
challenging 23 17 17 18 12 9
easy 6 31 15 14 29 6
hard 24 12 28 9 5 17
moderate 20 15 15 18 11 9
very_easy 6 17 6 9 50 1
very_hard 13 3 22 8 3 49
In [38]:
print(metrics.classification_report(ytest, testpred))
              precision    recall  f1-score   support

 challenging       0.25      0.24      0.24        96
        easy       0.33      0.31      0.32       101
        hard       0.27      0.29      0.28        95
    moderate       0.24      0.20      0.22        88
   very_easy       0.45      0.56      0.50        89
   very_hard       0.54      0.50      0.52        98

    accuracy                           0.35       567
   macro avg       0.35      0.35      0.35       567
weighted avg       0.35      0.35      0.35       567

In [39]:
# Step 1: Remove the newexcerpt related row (if necessary)
numerical_newexcerpt = newfinaldf.tail(1)

# Step 2: Scale the new data using the same scaler
numerical_newexcerpt_scaled = scaler.transform(numerical_newexcerpt)

# Step 3: Select the top 320 features using the same feature selection
numerical_newexcerpt_selected = selector.transform(numerical_newexcerpt_scaled)

# Step 4: Fit the Gradient Boosting model again on the training data
clsmodel.fit(xtrain, ytrain)  # This ensures the model is fitted

# Step 5: Make a prediction using the trained Gradient Boosting model
new_prediction = clsmodel.predict(numerical_newexcerpt_selected)

# Print the prediction result
print("Prediction for the new data:", new_prediction)
Prediction for the new data: ['challenging']

As seen above, the accuracy metrics are far from satisfactory. Despite extensive efforts in filtering out features using different models and experimenting with various test sizes, there has been no significant improvement in the model's performance. At one point, I even tried a deep learning model, but both the test and train accuracy hovered around 20%. This highlights an important lesson: no matter how sophisticated the model is, if the features are not relevant or meaningful, it will not enhance the predictions. As discussed in linear regression, we need features that genuinely influence the target variable. In this case, more informative or relevant data may be necessary to make an impact on the labels and improve the model's accuracy.

Problem 3.¶

Do the following.

  • Let's define a term: lexical_diversity = (number of words in the text)/ (number of unique words in the text). Find and print the most diverse and least diverse text using the definition above. What is the range of the lexical diversity score?
  • Find two lists of texts: the top 10 most similar and the top 10 most dissimilar excerpts in the original text data compared to the new excerpt using the cosine similarity metric. Then, repeat this process using the Jaccard Similarity coefficient as outlined on page 232 of the Web Data Mining book.(https://www.cs.uic.edu/~liub/WebMiningBook.html)
  • Use the process explained in 6.7.3 An Example Example 12 of Web Data Mining Book pages 246-248 to find out the document matrix A and use the SVD to write A as a product of $U$, $\Sigma$, and $V^T$ for the 6 text documents below with the given keyword list.
In [40]:
textdata.head(2)
Out[40]:
textid text Bradly_Terry_Score label
0 c12129c31 When the young people returned to the ballroom... -0.340259 easy
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... -0.315372 easy
In [41]:
textdata.columns
Out[41]:
Index(['textid', 'text', 'Bradly_Terry_Score', 'label'], dtype='object')
In [42]:
def lexical_diversity(sometext):
    words = sometext.split()
    uniquewords = list(set(words))
    divscore = round(len(words)/len(uniquewords),2)
    return divscore
In [43]:
textdf = textdata[['textid', 'text']]
textdf["lexical_diversity"] = textdf['text'].apply(lambda x: lexical_diversity(x))
In [44]:
textdf.head()
Out[44]:
textid text lexical_diversity
0 c12129c31 When the young people returned to the ballroom... 1.57
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... 1.33
2 b69ac6792 As Roger had predicted, the snow departed as q... 1.30
3 dd1000b26 And outside before the palace a great garden w... 1.39
4 37c1b32fb Once upon a time there were Three Bears who li... 2.88
In [45]:
textdf.nlargest(10, 'lexical_diversity')
Out[45]:
textid text lexical_diversity
859 420b4ae48 Cat and Dog look through the window. They look... 4.24
858 b55026bd9 This is Cat. This is Dog. Cat and Dog live in ... 3.18
861 78006971c For Dog it is too cold. Cat gives Dog underwea... 3.09
4 37c1b32fb Once upon a time there were Three Bears who li... 2.88
860 1c6ffcd35 Dog is in his house. Dog is sitting in his hou... 2.51
2790 dc8bb7a8c Acceleration is a measure of how fast velocity... 2.43
807 b3f2457aa A nerve is a group of special nerve cells grou... 2.36
263 2b2fdfc8c The boiling point of a substance is the temper... 2.35
1004 c182a398b Mother Goat passes by. "Will you go to the fai... 2.32
2695 6c755953d "The little girl wants a warm plaid dress. I w... 2.32
In [46]:
textdf.nsmallest(10, 'lexical_diversity')
Out[46]:
textid text lexical_diversity
2116 854fc1710 A piazza must be had.\nThe house was wide—my f... 1.23
2379 d8c7bf9bc After some work Tom succeeded in reducing the ... 1.23
177 669b6d8e1 They had got "way through," as Terry said, to ... 1.25
1396 04917fcad I know of no savage custom or habit of thought... 1.25
571 a4fa3021c A smartwatch is a computerized wristwatch with... 1.26
1287 d4a81e7b0 Careful investigation by our committees who ha... 1.26
1459 4ba8e0311 Bull, John, a fine, fat, American-beef fed ind... 1.26
2210 cd4a51e02 "Mollie Thurston, we are lost!" cried Barbara ... 1.26
1798 53bc19945 THIS house has no roof, no chimney, no windows... 1.27
1813 2c21c73ae The sparrow looks saucily at him, saying, "Ah,... 1.27
In [47]:
# The range of the lexical diversity score can vary from 1 to any number depending on
# the size of the text. Here it is ~3.
max(textdf.lexical_diversity)-min(textdf.lexical_diversity)
Out[47]:
3.0100000000000002
In [48]:
# newfinaldf has all the data we need. We can add two. Let's just pick the ids from 
#the textdata
newfinaldf.head(2)
Out[48]:
00 000 000th 001 02 03 034 04 049 06 ... character_count word_count sentence_count lexical_density ttr hapax_legomena avg_sentence_length complex_word_count polarity subjectivity
0 0 0 0 0 0 0 0 0 0 0 ... 814 179 11 0.849162 0.636872 89 14.916667 10 0.134848 0.525758
1 0 0 0 0 0 0 0 0 0 0 ... 769 169 14 0.727811 0.751479 107 15.545455 10 0.133999 0.566643

2 rows × 53128 columns

In [49]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity([newfinaldf.iloc[-1, :]], newfinaldf.iloc[:-1, :])
In [50]:
cosine_sim
Out[50]:
array([[0.98272164, 0.97822125, 0.97698086, ..., 0.97320766, 0.97769499,
        0.98402855]])
In [51]:
cos_simlist = list(cosine_sim[0].round(2))
In [52]:
len(cos_simlist)
Out[52]:
2834
In [53]:
textdata.shape
Out[53]:
(2834, 4)
In [54]:
textsim = textdata[['textid', "text"]]
textsim["cos_sim_with_newexcerpt"] = cos_simlist
textsim.head()
Out[54]:
textid text cos_sim_with_newexcerpt
0 c12129c31 When the young people returned to the ballroom... 0.98
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... 0.98
2 b69ac6792 As Roger had predicted, the snow departed as q... 0.98
3 dd1000b26 And outside before the palace a great garden w... 0.98
4 37c1b32fb Once upon a time there were Three Bears who li... 0.96
In [55]:
textsim.nlargest(10, 'cos_sim_with_newexcerpt')
Out[55]:
textid text cos_sim_with_newexcerpt
10 c57b50918 It was believed by the principal men of Virgin... 1.0
252 a6045da7b Many people like to learn about their family h... 1.0
253 0d3a8f33b Big data is a term for data sets that are so l... 1.0
256 e4d810c98 Although not normally what first comes to mind... 1.0
265 14365d003 Brain implants, often referred to as neural im... 1.0
266 92a8d63d2 In telecommunications, broadband is a wide ban... 1.0
267 b12cb6e0d The first regular television broadcasts starte... 1.0
273 201eff52d A cabinet is a body of high-ranking state offi... 1.0
276 62526c010 Carbon dioxide (chemical formula CO2) is a col... 1.0
277 d74e2a8a3 Carbon monoxide is produced from the partial o... 1.0
In [56]:
textsim.nsmallest(10, 'cos_sim_with_newexcerpt')
Out[56]:
textid text cos_sim_with_newexcerpt
990 9ba54834d My cousin Kamohelo leans on her hoe. What do I... 0.92
1803 0de5939cc MASTER BABY has been playing in the park all t... 0.93
1917 d64329167 Here is a boy drawing on a wall. He is a shoem... 0.93
1974 55fa093cb "But why?" yelped the pup, as the maid threw a... 0.93
1975 119070b51 The horse and the cow, in great grief, came an... 0.93
2586 34ec7fa04 Once, just as the long, dark time that is at t... 0.93
42 860580bf0 Jem hid her face on her arms and cried as if h... 0.94
858 b55026bd9 This is Cat. This is Dog. Cat and Dog live in ... 0.94
1938 c94355a18 On a dry pleasant day, last autumn, I saw her ... 0.94
2057 b49719b13 Edwin has two doves. They were given to him by... 0.94
In [57]:
# Jaccard Similarity
def jaccard(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    jaccard_sim = round(len(words1.intersection(words2))/len(words1.union(words2)), 2)
    return jaccard_sim
In [58]:
textsim["jacc_sim_with_newexcerpt"] = textsim['text'].apply(lambda x: jaccard(new_excerpt, x))
textsim.head()
Out[58]:
textid text cos_sim_with_newexcerpt jacc_sim_with_newexcerpt
0 c12129c31 When the young people returned to the ballroom... 0.98 0.03
1 85aa80a4c All through dinner time, Mrs. Fayre was somewh... 0.98 0.03
2 b69ac6792 As Roger had predicted, the snow departed as q... 0.98 0.03
3 dd1000b26 And outside before the palace a great garden w... 0.98 0.04
4 37c1b32fb Once upon a time there were Three Bears who li... 0.96 0.02
In [59]:
textsim[["textid", "text", "jacc_sim_with_newexcerpt"]].nlargest(10, 'jacc_sim_with_newexcerpt')
Out[59]:
textid text jacc_sim_with_newexcerpt
253 0d3a8f33b Big data is a term for data sets that are so l... 0.11
383 77f73d19f Geology describes the structure of the Earth o... 0.07
727 c9849a3ad The brain works like a computer, with multiple... 0.07
279 84101eee4 Brain functions, like perceptions, thoughts, a... 0.06
305 8f11d4954 A computer program is a list of instructions t... 0.06
320 f43a27b6d Data visualization or data visualization is vi... 0.06
322 f1a527e3b Databending (or data bending) is the process o... 0.06
423 5d8da7a16 Information technology (IT) is the use of comp... 0.06
426 9fb92d9b4 The Internet Protocol (IP) is the principal co... 0.06
429 8057d0e72 Intranets can help users to locate and view in... 0.06
In [60]:
textsim[["textid", "text", "jacc_sim_with_newexcerpt"]].nsmallest(10, 'jacc_sim_with_newexcerpt')
Out[60]:
textid text jacc_sim_with_newexcerpt
48 90f7894fc One night, returning from a hard day, on which... 0.01
205 b9c1ffa01 Though he was thoughtful beyond his years and ... 0.01
792 cfa18ebad Every day after school, Abebe went to the fiel... 0.01
861 78006971c For Dog it is too cold. Cat gives Dog underwea... 0.01
1444 3fdffab6d "Madam," said the white rooster, bowing very l... 0.01
1873 6cfa2f783 Mrs. S. had a new cook; and one day she set a ... 0.01
2200 aae3150c4 About ten o'clock on the following morning, se... 0.01
2260 decae8817 The Times' gentleman (a very difficult gent to... 0.01
2261 aac8c0e7d This royal pair had one only child, the Prince... 0.01
2333 7d909bbc3 The King had already been married once and had... 0.01
In [61]:
# Quotes
Henry_Ford = "Whether you think you can, or you think you can’t—you’re right."
Andrew_Carnegie = "The first one gets the oyster, the second gets the shell."
Warren_Buffett = "Someone’s sitting in the shade today because someone planted a tree a long time ago."
Mary_Kay_Ash = "Pretend that every single person you meet has a sign around their neck that says, 'Make me feel important.' Not only will you succeed in sales, you will succeed in life."
Richard_Branson = "Business opportunities are like buses, there’s always another one coming."
Jack_Welch = "Change before you have to."


# Wrong keywords = [
#    'Opportunity', 'Success', 'Vision', 'Innovation', 'Leadership', 'Strategy', 
#    'Growth', 'Change', 'Ambition', 'Determination', 'Value', 'Persistence', 
#    'Leadership', 'Sales', 'Transformation'
#]
keywords = [
    "think", "right", "first", "oyster", "second", "shell", 
    "shade", "tree", "long", "important", "sales", 
    "life", "opportunities", "change", "coming"]

I gave a wonrg list of key words so you can just give full poitnfor this part when you grade.¶

In [62]:
import pandas as pd
data = pd.DataFrame()
data['docs'] = ["hf", "ac", "wb", "ma", "rb", "jw"]
data['content'] = [Henry_Ford, Andrew_Carnegie, Warren_Buffett, Mary_Kay_Ash, Richard_Branson, Jack_Welch]
vect3 = CountVectorizer()  
vectors = vect3.fit_transform(data.content)
td = pd.DataFrame(vectors.todense())  
td.columns = vect3.get_feature_names_out()
term_document_matrix = td.T
term_document_matrix.columns = ["hf", "ac", "wb", "ma", "rb", "jw"]
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)
tdmatrix = term_document_matrix.drop(columns=['total_count'])
tdmatrix
Out[62]:
hf ac wb ma rb jw
ago 0 0 1 0 0 0
always 0 0 0 0 1 0
another 0 0 0 0 1 0
are 0 0 0 0 1 0
around 0 0 0 1 0 0
because 0 0 1 0 0 0
before 0 0 0 0 0 1
buses 0 0 0 0 1 0
business 0 0 0 0 1 0
can 2 0 0 0 0 0
change 0 0 0 0 0 1
coming 0 0 0 0 1 0
every 0 0 0 1 0 0
feel 0 0 0 1 0 0
first 0 1 0 0 0 0
gets 0 2 0 0 0 0
has 0 0 0 1 0 0
have 0 0 0 0 0 1
important 0 0 0 1 0 0
in 0 0 1 2 0 0
life 0 0 0 1 0 0
like 0 0 0 0 1 0
long 0 0 1 0 0 0
make 0 0 0 1 0 0
me 0 0 0 1 0 0
meet 0 0 0 1 0 0
neck 0 0 0 1 0 0
not 0 0 0 1 0 0
one 0 1 0 0 1 0
only 0 0 0 1 0 0
opportunities 0 0 0 0 1 0
or 1 0 0 0 0 0
oyster 0 1 0 0 0 0
person 0 0 0 1 0 0
planted 0 0 1 0 0 0
pretend 0 0 0 1 0 0
re 1 0 0 0 0 0
right 1 0 0 0 0 0
sales 0 0 0 1 0 0
says 0 0 0 1 0 0
second 0 1 0 0 0 0
shade 0 0 1 0 0 0
shell 0 1 0 0 0 0
sign 0 0 0 1 0 0
single 0 0 0 1 0 0
sitting 0 0 1 0 0 0
someone 0 0 2 0 0 0
succeed 0 0 0 2 0 0
that 0 0 0 2 0 0
the 0 4 1 0 0 0
their 0 0 0 1 0 0
there 0 0 0 0 1 0
think 2 0 0 0 0 0
time 0 0 1 0 0 0
to 0 0 0 0 0 1
today 0 0 1 0 0 0
tree 0 0 1 0 0 0
whether 1 0 0 0 0 0
will 0 0 0 2 0 0
you 5 0 0 3 0 1
In [63]:
docmatrix = tdmatrix.loc[tdmatrix.index.isin(keywords)] # Filter the given terms
docmatrix = docmatrix.loc[keywords] # Order by the terms
A = docmatrix
A
Out[63]:
hf ac wb ma rb jw
think 2 0 0 0 0 0
right 1 0 0 0 0 0
first 0 1 0 0 0 0
oyster 0 1 0 0 0 0
second 0 1 0 0 0 0
shell 0 1 0 0 0 0
shade 0 0 1 0 0 0
tree 0 0 1 0 0 0
long 0 0 1 0 0 0
important 0 0 0 1 0 0
sales 0 0 0 1 0 0
life 0 0 0 1 0 0
opportunities 0 0 0 0 1 0
change 0 0 0 0 0 1
coming 0 0 0 0 1 0
In [64]:
from numpy.linalg import svd
# Perform Singular Value Decomposition
U, s, Vt = svd(A)

# Convert the singular values into a diagonal matrix
S = np.diag(s)

# Display the U, S, and Vt matrices
print("Matrix U:")
print(U)

print("\nMatrix S (Singular values):")
print(S)

print("\nMatrix Vt:")
print(Vt)
Matrix U:
[[-0.89442719  0.          0.          0.          0.          0.
  -0.12909944 -0.12909944 -0.12909944 -0.12909944 -0.12909944 -0.12909944
  -0.15811388 -0.2236068  -0.15811388]
 [-0.4472136   0.          0.          0.          0.          0.
   0.25819889  0.25819889  0.25819889  0.25819889  0.25819889  0.25819889
   0.31622777  0.4472136   0.31622777]
 [ 0.         -0.5         0.          0.          0.          0.
  -0.4330127  -0.4330127  -0.4330127   0.14433757  0.14433757  0.14433757
   0.1767767   0.25        0.1767767 ]
 [ 0.         -0.5         0.          0.          0.          0.
   0.14433757  0.14433757  0.14433757 -0.4330127  -0.4330127  -0.4330127
   0.1767767   0.25        0.1767767 ]
 [ 0.         -0.5         0.          0.          0.          0.
   0.14433757  0.14433757  0.14433757  0.14433757  0.14433757  0.14433757
  -0.53033009  0.25       -0.53033009]
 [ 0.         -0.5         0.          0.          0.          0.
   0.14433757  0.14433757  0.14433757  0.14433757  0.14433757  0.14433757
   0.1767767  -0.75        0.1767767 ]
 [ 0.          0.         -0.57735027  0.          0.          0.
   0.66666667 -0.33333333 -0.33333333  0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.         -0.57735027  0.          0.          0.
  -0.33333333  0.66666667 -0.33333333  0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.         -0.57735027  0.          0.          0.
  -0.33333333 -0.33333333  0.66666667  0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.         -0.57735027  0.          0.
   0.          0.          0.          0.66666667 -0.33333333 -0.33333333
   0.          0.          0.        ]
 [ 0.          0.          0.         -0.57735027  0.          0.
   0.          0.          0.         -0.33333333  0.66666667 -0.33333333
   0.          0.          0.        ]
 [ 0.          0.          0.         -0.57735027  0.          0.
   0.          0.          0.         -0.33333333 -0.33333333  0.66666667
   0.          0.          0.        ]
 [ 0.          0.          0.          0.         -0.70710678  0.
   0.          0.          0.          0.          0.          0.
   0.5         0.         -0.5       ]
 [ 0.          0.          0.          0.          0.         -1.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          0.          0.         -0.70710678  0.
   0.          0.          0.          0.          0.          0.
  -0.5         0.          0.5       ]]

Matrix S (Singular values):
[[2.23606798 0.         0.         0.         0.         0.        ]
 [0.         2.         0.         0.         0.         0.        ]
 [0.         0.         1.73205081 0.         0.         0.        ]
 [0.         0.         0.         1.73205081 0.         0.        ]
 [0.         0.         0.         0.         1.41421356 0.        ]
 [0.         0.         0.         0.         0.         1.        ]]

Matrix Vt:
[[-1. -0. -0. -0. -0. -0.]
 [-0. -1. -0. -0. -0. -0.]
 [-0. -0. -1. -0. -0. -0.]
 [-0. -0. -0. -1. -0. -0.]
 [-0. -0. -0. -0. -1. -0.]
 [-0. -0. -0. -0. -0. -1.]]
In [65]:
print("sigma =", s)
sigma = [2.23606798 2.         1.73205081 1.73205081 1.41421356 1.        ]

Refer to Chapter 7 of Web Data Mining Book for the following problems.

Problem 4. (Refer to Chapter 7 of Web Data Mining Book for this problem.)¶

Upload and read the social network connections dataset containing two columns (id_1 and id_2), representing the connections between individuals.

  • Calculate the total number of connections for each ID by aggregating the values from both columns, treating each as a connection count for an individual. Identify the top 10 IDs with the most connections, and print them in descending order to highlight the central actors in the network.
  • Remove all IDs that have 300 or fewer connections from both columns (id_1 and id_2) to focus on the more central actors within the network. Display the shape of the filtered dataset to verify the reduced size and check the value counts of id_1 to understand the distribution of connections after filtering. Verify that all remaining IDs have more than 300 connections.
  • Create a network graph for the ID with the most connections, adding labels, titles, and highlighting the central node to show its importance. Then, compute the betweenness centrality for the network, which indicates how nodes bridge others. Find the top 10 IDs with the highest centrality scores and display them in a bar chart with clear labels and titles to illustrate their significance.
    Helpful link. https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html
In [66]:
import pandas as pd
ids = pd.read_csv("sn_ids.csv")
ids.head()
Out[66]:
id_1 id_2
0 0 23977
1 1 34526
2 1 2370
3 1 14683
4 1 29982
In [67]:
ids.id_1.value_counts().nlargest(10)
Out[67]:
id_1
27803    6809
31890    1988
13638    1610
19222    1459
9051     1378
2078     1295
7027     1224
10001    1149
5629     1111
73       1085
Name: count, dtype: int64
In [68]:
ids.id_2.value_counts().nlargest(10)
Out[68]:
id_2
31890    7470
35773    2401
36652    2285
18163    1858
19222    1499
36628    1477
35008    1472
3712      884
13638     858
30002     819
Name: count, dtype: int64
In [69]:
id1counts = ids.id_1.value_counts().to_frame().reset_index()
id1counts.columns = ["id_1", "connections"]
id1mostfrequent = id1counts[id1counts.connections > 300]
frequentid1 = list(id1mostfrequent.id_1)
print(frequentid1)
print(id1mostfrequent.shape)
[27803, 31890, 13638, 19222, 9051, 2078, 7027, 10001, 5629, 73, 33671, 35773, 11051, 10595, 3153, 19253, 11279, 3922, 14242, 974, 6631, 7195, 18945, 2281, 2635, 14954, 10080, 8635, 10830, 22881, 22642, 36289, 29421, 3712, 1164, 494, 9780, 20173, 23589, 2431, 18562, 22353, 33799]
(43, 2)
In [70]:
filteredby_id1 = ids[ids["id_1"].isin(frequentid1)]
print(filteredby_id1.shape)
(35713, 2)
In [71]:
id2counts = ids.id_2.value_counts().to_frame().reset_index()
id2counts.columns = ["id_2", "connections"]
id2mostfrequent = id2counts[id2counts.connections > 300]
frequentid2 = list(id2mostfrequent.id_2)
print(frequentid2)
print(id2mostfrequent.shape)
[31890, 35773, 36652, 18163, 19222, 36628, 35008, 3712, 13638, 30002, 15191, 19253, 33029, 25477, 23589, 22642, 28957, 22666, 36790, 30199, 35523, 22881, 23664, 34536, 33643, 37289, 16119, 22832, 31917, 22353, 35876, 27450, 37471, 10001, 9051, 34114, 31126, 21142, 29982, 30809, 14242, 27302, 25249, 23838, 36819, 11051, 37107, 5323, 5300, 22321, 30235, 33128, 32753]
(53, 2)
In [72]:
filtered_df = ids[ids['id_1'].isin(frequentid1) & ids['id_2'].isin(frequentid2)]
filtered_df.shape
Out[72]:
(661, 2)
In [73]:
filteredby_id2 = ids[ids["id_2"].isin(frequentid2)]
print(filteredby_id2.shape)
(40840, 2)
In [74]:
filteredby_id2.id_2.value_counts()
Out[74]:
id_2
31890    7470
35773    2401
36652    2285
18163    1858
19222    1499
36628    1477
35008    1472
3712      884
13638     858
30002     819
15191     811
19253     723
33029     686
25477     635
23589     621
22642     596
28957     584
22666     583
36790     573
30199     536
35523     527
22881     512
23664     488
34536     484
33643     477
37289     471
16119     469
22832     467
31917     461
22353     454
35876     448
27450     437
37471     425
10001     419
9051      419
34114     399
31126     399
21142     390
29982     389
30809     375
14242     374
27302     373
25249     372
36819     363
23838     363
11051     358
5323      355
37107     355
5300      351
22321     350
30235     341
33128     303
32753     301
Name: count, dtype: int64
In [75]:
# Looking above, the id 31890 from id2 has the most number of connections.
# Picking the id 27803 from id1 is fine.
mostconnected = ids[ids.id_2 == 31890]
mostconnected.shape
Out[75]:
(7470, 2)
In [76]:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.from_pandas_edgelist(mostconnected, 'id_1', 'id_2')
plt.figure(figsize=(20,20)) 
nx.draw(G)
plt.show()
No description has been provided for this image
In [77]:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2')
most_connected_id = max(dict(G.degree()).items(), key=lambda x: x[1])[0]

# Step 3: Plot the network highlighting the central node
plt.figure(figsize=(16, 16))
pos = nx.spring_layout(G, k=0.1)  # Position the nodes using spring layout

# Draw the nodes with different colors for the most connected node
nx.draw(G, pos, node_color='skyblue', node_size=50, with_labels=False)
nx.draw_networkx_nodes(G, pos, nodelist=[most_connected_id], node_color='red', node_size=200)

# Add labels for the central node and the title
nx.draw_networkx_labels(G, pos, labels={most_connected_id: most_connected_id}, font_size=12, font_color='black')
plt.title(f"Network Graph Highlighting the Most Connected ID: {most_connected_id}", size=15)
plt.show()
No description has been provided for this image
In [78]:
ids.head()
Out[78]:
id_1 id_2
0 0 23977
1 1 34526
2 1 2370
3 1 14683
4 1 29982
In [79]:
ids.shape
Out[79]:
(289003, 2)
In [80]:
# 10 highest betweenness centrality.
df = ids.sample(n=100000)
G = nx.from_pandas_edgelist(df, "id_1", "id_2")
closecent = nx.closeness_centrality(G) 
In [81]:
ccdf = pd.DataFrame(closecent.items())
ccdf.columns = ['id', 'betweenness_centrality']
ccdf.head()
Out[81]:
id betweenness_centrality
0 6218 0.296119
1 32338 0.241891
2 20528 0.293416
3 6498 0.255286
4 12571 0.314411
In [82]:
ccdf.dtypes
Out[82]:
id                          int64
betweenness_centrality    float64
dtype: object
In [83]:
import seaborn as sns
import matplotlib.pyplot as plt
top_10 = ccdf.nlargest(10, 'betweenness_centrality')
sns.barplot(x='id', y='betweenness_centrality', data=top_10)
plt.xlabel('IDs')
plt.ylabel('Betweenness Centrality')
plt.title('Top 10 Betweenness Centrality')
plt.show()
No description has been provided for this image
In [ ]: